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Abstract 

We develop a new tool for data-dependent analysis of the exploration-exploitation trade-ofF 
in learning under limited feedback. Our tool is based on two main ingredients. The first 
ingredient is a new concentration inequality that makes it possible to control the concentra- 
tion of weighted averages of multiple (possibly uncountably many) simultaneously evolving 
and interdependent martingales.^ The second ingredient is an application of this inequality 
to the exploration-exploitation trade-off via importance weighted sampling. We apply the 
new tool to the stochastic multiarmed bandit problem, however, the main importance of 
this paper is the development and understanding of the new tool rather than improvement 
of existing algorithms for stochastic multiarmed bandits. In the follow-up work we demon- 
strate that the new tool can improve over state-of-the-art in structurally richer problems, 
such as stochastic multiarmed bandits with side information (Seldin et al., 2011a). 
Keywords: PAC-Bayesian Analysis, Bernstein's Inequality, Martingales, Multiarmed Ban- 
dits, Model Order Selection, Exploration-Exploitation Trade-off 



1. Introduction 

Learning under limited feedback and the exploration-exploitation trade-off are the funda- 
mental questions in fields like reinforcement and active learning. The existing theoretical 
analysis of the exploration-exploitation trade-off in problems that go beyond multiarmed 
bandits is mainly focused on the worst-case scenarios (Strehl et al., 2009; Jaksch et al., 
2010; Beygelzimer et al., 2011, 2009). But the worst-case analysis is overly pessimistic if the 
environment is not adversarial and cannot exploit the opportunities provided by benign con- 
ditions. We present a new analysis framework that lays the foundation for data-dependent 
analysis of the exploration-exploitation trade-off. 

1. See also our follow-up work on PAC-Bayesian inequalities for martingales (Seldin et al., 2011b) 
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Our framework is based on PAC-Bayesian analysis. The PAC-Bayesian analysis was 
introduced over a decade ago (Shawe- Taylor and Williamson, 1997; Shawe- Taylor et al., 
1998; McAllester, 1998; Seeger, 2002) and has since made a significant contribution to the 
analysis and development of supervised learning methods. PAC-Bayesian bounds provide 
an explicit and often intuitive and easy-to-optimize trade-off between model complexity and 
empirical data fit, where the complexity can be nailed down to the resolution of individ- 
ual hypotheses via the definition of the prior. The PAC-Bayesian analysis was applied to 
derive generalization bounds and new algorithms for linear classifiers and maximum mar- 
gin methods (Langford and Shawe- Taylor, 2002; McAllester, 2003; Germain et al., 2009), 
structured prediction (McAllester, 2007), and clustering-based classification models (Seldin 
and Tishby, 2010), to name just a few. However, the application of PAC-Bayesian analysis 
beyond the supervised learning domain remained surprisingly limited. In fact, the only 
additional domain known to us is density estimation (Seldin and Tishby, 2010; Higgs and 
Shawe- Taylor, 2010). 

Application of PAC-Bayesian analysis to non-i.i.d. data was partially addressed only 
recently by Ralaivola et al. (2010) and Lever et al. (2010). The solution of Ralaivola et al. is 
based on breaking the sample into independent (or almost independent) subsets (which also 
reduces the effective sample size to the number of independent subsets). Such an approach 
is inapplicable in reinforcement learning due to strong dependence of the learning process 
on all of its history. Lever et al. treated dependent samples in the context of analysis of U- 
statistics. They employed Hoeffding's canonical decomposition of U-statistics into forward 
martingales and applied PAC-Bayesian analysis directly to these martingales. The approach 
presented here is both tighter and more general. 

We present a generalization of PAC-Bayesian analysis to martingales. Our generaliza- 
tion makes it possible to consider model order selection simultaneously with the exploration- 
exploitation trade-off. Some potential advantages of applying PAC-Bayesian analysis in re- 
inforcement learning were recently pointed out by several researchers, including Tishby and 
Polani (2010) and Fard and Pineau (2010). Tishby and Polani suggested to use the mutual 
information between states and actions in a policy as a natural regularizer in reinforcement 
learning. They showed that regularization by mutual information can be incorporated into 
Bellman equations and thereby computed efficiently. Tishby and Polani conjectured that 
PAC-Bayesian analysis can be applied to justify such a regularization and provide general- 
ization guarantees for it. 

Fard and Pineau derived a PAC-Bayesian analysis of batch reinforcement learning. How- 
ever, batch reinforcement learning does not involve the exploration-exploitation trade-off. 

One of the reasons for the difficulty of applying PAC-Bayesian analysis to address the 
exploration-exploitation trade-off is limited feedback (the fact that we only observe the 
reward for the action taken, but not for all other actions). In supervised learning (and also 
in density estimation) the empirical error of each hypothesis in a hypotheses class can be 
evaluated on all the samples and, therefore, the size of the sample available for evaluation 
of all the hypotheses is the same (and usually relatively large). In the situation of limited 
feedback the samples from one action cannot be used to evaluate another action and the 
sample size of "bad" actions has to increase sublinearly in the number of game rounds. 
In a precursory report (Seldin et al., 2011c) we overcame this difficulty by applying PAC- 
Bayesian analysis to importance weighted sampling (Sutton and Barto, 1998). Importance 
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weighted sampling is commonly used in the analysis of non-stochastic bandits (Auer et al., 
2002b), but has not previously been applied to the analysis of stochastic bandits. 

The usage of importance weighted sampling introduces two new difficulties. One is 
sequential dependence of the samples: the rewards observed in the past influence distri- 
bution over actions played in the future and through this distribution the variance of the 
subsequent weighted sample variables. The second problem introduced by weighted sam- 
pling is the growing variance of the weighted sample variables. In Seldin et al. (2011c) we 
handled this dependence by combining PAC-Bayesian analysis with Hoeffding-Azuma-type 
inequalities for martingales. The bounds achieved by such a combination provide 
convergence rate, where t is the time step and et is the minimal probability of sampling 
any action at time step t. The combination with Bernstein- type inequality for martingales 
presented here achieves 0(;^^=) convergence rate. This improvement makes it possible to 

tighten the regret bounds from 0{K^^'^t^^'^) to O^K^^^t"^^^), where K is the number of arms. 
In Section 3 we suggest possible ways to tighten the analysis further to get 0{\/ Kt) regret 
bounds. These further improvements will be studied in detail in future work. 

We repeat that our main goal is not improvement of existing bounds for stochastic 
multiarmed bandits, which are already tight up to ^yln{K) factors (Audibert and Bubeck, 
2009; Auer and Ortner, 2010), but rather development of a new powerful tool for rein- 
forcement learning and for other domains with richer structure. The multiarmed bandits 
serve us as a testbed for the development of this new tool. One example of a problem 
with a richer structure are multiarmed bandits with side information (a.k.a. contextual 

bandits). Beygelzimer et al. (2011) suggested O (^y^ Ktln{N/6)J and O (^y^t{dlnt - In 5)^ 
regret bounds for learning with expert advice in multiarmed bandits with side informa- 
tion, where N is the number of experts (in case it is finite) and d is the VC-dimension of 
the set of experts (in case it is infinite). In the follow-up paper Seldin et al. (2011a) we 
show that PAC-Bayesian analysis makes it possible to replace In(A^) and d factors with 
KL{p\\ij,), where KL is the KL-divergence, p{h) is a distribution over the experts played by 
the algorithm, and /i(/i) is a prior distribution over the experts. Such an approach is much 
more fiexible, since it allows individual treatment of different experts (or policies) via the 
definition of the prior fi. 

The paper is organized as follows: Section 2 surveys the main results of the paper. 
Section 3 suggests possible ways to tighten the analysis further, and Section 4 discusses the 
results. Proofs are provided in the appendix. 

2. Main Results 

We start with a general concentration result for martingales based on combination of PAC- 
Bayesian analysis with a Bernstein-type inequality for martingales. Then, we apply this 
result to derive an instantaneous (per-round) bound on the distance between expected and 
empirical regret for the multiarmed bandit problem. This result is in turn applied to derive 
an instantaneous regret bound for the multiarmed bandits. 
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2.1. PAC-Bayes-Bernstein Inequality for Martingales 

In order to present our concentration result for martingales we need a few definitions. Let % 
be an index (or a hypothesis) space, possibly uncountably infinite. Let {Xi(/i), X2Q1), ■ ■ ■ : 
h G T-i} be martingale difference sequences, meaning that '&[Xt{h)\Tt-i] = 0, where Tt = 
{Xr{h) : 1 < T < t and h G 71} is a set of martingale differences observed up to time t 
(the history). {{Xt{h)}h£H do not have to be independent, we only need the requirement 
on the conditional expectation to be satisfied.) Let Mt{h) = Y1t=i -^rih) be martingales 
corresponding to the martingale difference sequences and let Vt{h) = YlT=i^[-^r{h)'^\TT-i] 
be cumulative variances of the martingales. For a distribution p over 7i define weighted 
averages of the martingales and their cumulative variances with respect to p as Mt{p) = 
E^(;,)[Mt(/i)] and Vt{p) = ^pi^h)[Vt{h)]. 

Theorem 1 (PAC-Bayes-Bernstein Inequality) Let {Ci, C2, . . . } he an increasing se- 
quence set in advance, such that \Xt{h)\ < Ct for all h with probability 1. Let {pi, /i2, • • • } 
he a sequence of "reference" ( "prior") distributions over %, such that pt is independent of 
Tt {but can depend on t). Let {Ai, A2, . . . } he a sequence of positive numbers set in advance 
that satisfy: 

A. < i (1) 

Then for all possible distributions pt over % given t and for all t simultaneously with prob- 
ability greater than 1 — 5: 

\M( ,,^ KL{pt\\pt) + 2Ht + l) + \nl ^ ^ 

Wt{pt)\ < 7 + (e - 2)\tVt{pt). (2) 



get 



Bound (2) is minimized by \t = \l ^ , Por this value of \t we would 

\Mt{pt)\ < 2 J (e - 2)Vt{pt) ( KL{pt\\pt) + 2 ln(t + 1) + In (3) 



however, At has to be set in advance and cannot depend on the sample. Therefore, we have 
to make our best guess of what the values of KL{pt\\pt) and Vf (pt) are going to be, which is 
actually possible in the case that we study below. In the follow-up paper we show that by 
taking an exponentially spaced grid of At-s and a union bound over this grid it is possible 
to derive a bound, which is almost as good as (3) (Seldin et al., 2011b), but this extension 
is not required in the current work. 

2.2. Application to the Multiarmed Bandit Problem 

In order to apply our result to the multiarmed bandit problem we need some more defini- 
tions. Let ^ be a set of actions (arms) of size |^| = -ftT and let a G ^ denote the actions. 
Denote by R{a) the expected reward of action a. Let vrj be a distribution over A that is 
played at round t of the game (a policy). Let {Ai,A2., . . . } be the sequence of actions played 
independently at random according to {tti, 7r2, . . . } respectively. Let ^2, • • • } be the 
sequence of observed rewards. Denote by 7t = {{tti, . . . , vrt}, {^1, . . . , ^t}, {.Ri, . . . , Rt}} 
the set of played policies, taken actions, and observed rewards up to round t. 
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For t > 1 and a £ {1,...,K} define a set of random variables Rf (the importance 
weighted samples): 

f -^Rt, if At = a 

Define: 

t 



0, otherwise. 
1 * 

Rt{a) = -Y,Rr- 



t 

T = l 

Observe that E[i?^|7i-i] = R{a) and E^t(o) = i?(a). 

Let a* be the "best" action (the action with the highest expected reward, if there are 
multiple "best" actions pick any of them). Define the expected and empirical per-round 
regrets as: 

A(a) = R{a*) - R{a), 
Atia) = Rt{a*)-Rt{a). 

Observe that t{At{a) — A(a)) form a martingale. Let 

t 

Vt{a) = Y^E[{[Rf - R-] - [R{a*) - R{a)]f\Tr-i] 

T=l 

be the cumulative variance of this martingale. 

Let {ei,e2, ■ ■ ■} be a decreasing sequence that satisfies et < inmairt{a) (we say that 
7rj(a) is bounded from below by £t). In the appendix we prove the following upper bound 
on Vt{a). 

Lemma 2 For all t and a: 

Vt{a) < -. 

For a distribution p over A define the expected and empirical regret of p as A(p) = 
Ep(„)[A(a)] and Atip) = Ep{a)[^t{cL)]- The following theorem follows immediately from 
Theorem 1 and Lemma 2 by taking a uniform prior over the actions. 

Theorem 3 For any sequence of sampling distributions {7ri,7r2, . . . } that are bounded from 
below by a decreasing sequence {ei,£2, ■ ■ ■} that satisfies 



ln(K) + 21n(t+ 1) +ln 



2 



2(^t S 0) 

where nt can depend on Tt-i, for all possible distributions pt given t and for all t > 1 
simultaneously with probability greater than 1—5: 
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Proof For a uniform prior /it(a) = we have KL[pt\\pLt) ^ In(i^). By Lemma 2, for 



any pt the weighted cumulative variance is bounded by Vt{pt) < ff- By taking A 

ln(_ft:)+21n{t+l)+ln | 



2(^e-2)t ^'^^ substituting the bounds on KL{pt\\pt) and Vt{pt) into (2) we ob- 

tain (5). (We considered the martingales t(A(a) — A((a)), which provided a factor of t in 
the denominator.) The technical condition (4) follows from the requirement (1) on At. ■ 

Remarks: Theorem 3 provides an improvement over the corresponding Theorems 2 
and 3 in the precursory report (Seldin et al., 2011c) by decreasing the dependence on et 
from l/et to 1/y^. This in turn makes it possible to improve the regret bound, which is 
shown next. Interestingly, the uniform prior pt yields a tighter (and also simpler) bound 
than a distribution-dependent prior used in Seldin et al. (2011c). It also broadens the 
range of playing strategies for which the regret bound given in Theorem 4 holds. We 
note that the uniform prior neutralizes the power of PAC-Bayesian analysis to discriminate 
between different hypotheses. For problems with richer structure studied in the follow-up 
paper (Seldin et al., 2011a), more interesting priors can be defined that yield advantages 
over alternative approaches. The multiarmed bandit problem studied here is, nevertheless, 
important for the development of the new tool. 

We note that in the next theorem we take et = K^'^/^t~^/^ and the technical condition 
(4) is satisfied for t that is slightly larger than i^(ln(i^) + In |)^^^. 

Theorem 4 Let et = K^'^/'^t^^/^ and take any -yt, such that 74 > K^^/^t^^^VlnK . For 
t < K let TTti^a) = for all a and for t > K let 

TTt+iia) = pria) = (1 - Ket+i)pna)+et+i, 

where 

Pt ^'')-z{pr) 

and 

a 

Then the expected per-round regret A{pl'^'') = R(a*) — R{pl^^) is bounded by: 

MpD < (ifljlTa + Vh^K + 2^2(e - 2) (in(if) + 2 ln{t + 1) + In ^) ^ 



with probability greater than 1 — 6 simultaneously for all rounds t, where t satisfies (4) {which 
means that t > K ^ in(^)+2 in(t+i)+in j ^ ^ note that t also appears on the right hand side). 
This translates into a total regret of 0{K^^^t^/^) {where O hides logarithmic factors). 

For 7t = the playing strategy in Theorem 4 is known as the EXP3 algorithm for 
adversarial bandits (Auer et al., 2002b), which is applied here to stochastic bandits. When 
7t tends to infinity, we obtain the e-greedy algorithm for stochastic bandits (Auer et al., 
2002a). Theorem 4 covers the spectrum of all possible intermediate strategies. 
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3. Towards a Tighter Regret Bound 

We note that there is stiU a room for improvement, which we beUeve wih enable to achieve 
regret bounds of order 0(\/ Kt). The main source of looseness is the usage of the crude global 
upper bound ^ on the cumulative variances in Lemma 2 that holds for any distribution 
pt- While this bound seems to be tight for the e- greedy strategy, we believe that it can be 
tightened for the EXP3 algorithm. It is possible to show that if we play according to the 
distributions {p"^'' , . . . , pl''^}, then for "good" actions a (those for which A(a) < ^) the 
cumulative variance Vt{a) is bounded by CKt for some constant C. If we could show that 
for "bad" actions a (those for which A (a) > ^) the probability p^"''' of picking such actions 
is bounded by Cet, then the cumulative variance Vt{pl'''') would be bounded by CKt. This 
is, in fact, true for "very bad" actions (those, for which A(a) is close to 1), but it does not 
hold for actions with A(a) close to ^. However, we can possibly show that for such actions 
pT^'iO') < Cet for most of the rounds {1 — et fraction will suffice) and then we will be able 
to achieve 0{V Kt) regret. In the experiment that follows we provide an empirical evidence 
that this conjecture holds in practice. 

Another possible approach is to apply the EXP3.P algorithm of Auer et al. (2002b). 
However, in the experiment that follows we show that in the stochastic setting EXP3 al- 
gorithm achieves much lower regret than EXP3.P. It is, therefore, worth exploring the first 
route. We also note that Auer et al. (2002b) do not provide an explicit bound on the 
variance of EXP3.P, which is required for our bound. This would have to be done for the 
second way of achieving 0{V Kt) regret bound. 

3.1. Empirical Test Study 

In the following experiment we show that in the stochastic setting EXP3 algorithm achieves 
lower regret compared to EXP3.P.1 algorithm of Auer et al. (2002a). We also show that 
the variance of EXP3 algorithm is reasonably close to 2Kt. Finally, we show that in the 
stochastic setting the regret of EXP3 algorithm is comparable or even lower than the regret 
of UCB strategy (Auer et al., 2002a) in the short run, but gets worse in the long run. We 
note that UCB strategy is not compatible with PAC-Bayesian analysis, since in UCB every 
action has its own sample size and the sample size of "bad" actions grows sublinearly with 
the number of game rounds. Designing a strategy that would be compatible with PAC- 
Bayesian analysis and achieve the regret of UCB in the long run is an important direction 
for future research. 

Experiment Setup 

We took a 2-arm bandit problem with biases 0.5 and 0.6 for the two arms and ran EXP3 
algorithm from Theorem 4 with et = l/\/Kt and •jt = \/t In K/K, EXP3.P.1 algorithm of 
Auer et al. (2002b) with 6 = 0.001, and UCBl algorithm of Auer et al. (2002a). In the 
first experiment we made 1000 repetitions of the game and in each game we ran each of 
the algorithms for 10,000 rounds. In the second experiment we made 100 repetitions of the 
game and in each game we ran each of the algorithms for 10^ rounds. In Figure 1 we show: 

l.a Experiment 1 (10"^ rounds): Average (over 1000 repetitions of the game) cumulative 
regret of EXP3, EXP3.P.1, and UCBl algorithms. 
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1: Experimental results. Solid lines show mean values over experiment repeti- 
tions, dotted lines show mean values plus one standard deviation (std). 



l.b Experiment 1: Average cumulative variance of EXP3 and EXP3.P.1 normalized by 
2Kt, which is what we would like it to be: 2lci'iUoo Si=T ^/(Pt)> where i G [1, . . . , 1000] 
indexes the experiments. 

l.c Experiment 2 (10'' rounds): Average (over 100 repetitions of the game) cumulative 
regret of EXP3 and UCBl algorithms. The regret of EXP3.P.1 algorithm was far 
above the regret of EXP3 and UCBl and, therefore, was omitted from the graphs. 

l.d Experiment 2: Average cumulative variance of EXP3 normalized by 2Kt. 
Observations 

1. In the stochastic setting the performance of EXP3 is significantly superior to the 
performance of EXP3.P.1. 

2. In the stochastic setting, the performance of EXP3 is comparable or even superior to 
the performance of UCBl in the short run, but becomes worse than the performance 
of UCBl in the long run (beyond 2 • 10^ iterations). The reason is that the number 
of pulls of the suboptimal arm are roughly \/t for EXP3 and ln(t)/A(a)^ for UCB. 
In our experiment A(a) = 0.1 for the suboptimal arm, thus \/t > ln(t)/A(a)^ when 
t > ln(t)2/A(a)^, which holds when t>2 - 10^. 
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3. In the stochastic setting, the variance of EXP3 is initially higher than the variance of 
EXP3.P.1, but eventually it becomes lower. 

4. Initially the variance of EXP3 is just slightly above 2Kt (by a factor of less than 2) 
and eventually it stabilizes around 0.66 • 2Kt for the problem that we considered. 

4. Discussion 

We presented a new framework for data-dependent analysis of the exploration-exploita- 
tion trade-off and for simultaneous analysis of model order selection and the exploration- 
exploitation trade-off. We note that model order selection does not come up in the mul- 
tiarmed bandit problem due to simplicity of the structure of this problem. Nevertheless, 
the multiarmed bandit problem is a convenient playground for the development of the new 
tool. In the follow-up paper we show that the new technique developed here can be applied 
to multiarmed bandits with side information and yield an advantage over state-of-the-art 
(Seldin et al., 2011a). 

An important direction for future research is to tighten Theorems 3 and 4, so that the 
regret bound will match state-of-the-art regret bounds obtained by alternative techniques. 
We believe that the ideas described in Section 3 can make it possible. The experiments 
presented in Section 3 show that empirically in the stochastic setting our algorithm is 
significantly superior to state-of-the-art algorithms for adversarial bandits and slightly worse 
than state-of-the-art algorithms for stochastic bandits. Closing the gap with state-of-the-art 
algorithms for stochastic bandits is another important direction for future research. 

Other directions for future research include application of our framework to Markov 
decision processes (Fard and Pineau, 2010), active learning (Beygelzimer et al., 2009), and 
problems with continuous state and action spaces, such as Gaussian process bandits (Srini- 
vas et al., 2010). 

Appendix A. Proofs 

In this appendix we provide the proofs of Theorems 1 and 4 and Lemma 2. 
A.l. Proof of Theorem 1 

The proof of Theorem 1 relies on the following two lemmas. The first one is a Bernstein-type 
inequality. For a proof of Lemma 5 see, for example, the proof of Theorem 1 in Beygelzimer 
et al. (2011). 

Lemma 5 (Bernstein's inequality) Let Xi, . . . ,Xt be a martingale difference sequence 
(meaning that E,[Xr\Xi, . . . , X,— i] = for all r), such that Xr < C for all t with probability 
1. Let Mt = Y1t=i -^r be a corresponding martingale and Vt = . . . , Xr-i] 

be the cumulative variance of this martingale. Then for any fixed X S [0, 

The second lemma originates in statistical physics and information theory (Donsker and 
Varadhan, 1975; Dupuis and Ellis, 1997; Gray, 2011) and forms the basis of PAC-Bayesian 
analysis. See (Banerjee, 2006) for a proof. 
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Lemma 6 (Change of measure inequality) For any measurable function (j){h) on % 
and any distributions ^{h) and p{h) on %, we have: 

Ep(;^)[</'(/i)] < i^L(p||/x) +lnE^(,)[e*W]. 

Now we are ready to state the proof of Theorem 1 . 
Proof of Theorem 1 Take (?!>(/i) = XtMt{h) - (e - 2)XfVt{h) and 6t = j^^6 > j^^S. 

(It is well-known that Ylt^i t(t+i) ~ St=i (i ~ tTi) ~ Then the following holds for all 
pt and t simultaneously with probability greater than 1 — ^: 
XtMtipt) - {e-2)XlVt{pt) = Ep^^f,^[XtMtih) - (e - 2)A,Vt(/i)] (6) 

< KL(/,i||^i) + lnE^,(;,)[e^'^^*W-(^-2)^?^*W] (7) 

< KL{pt\\pt) + 2 ln(t + 1) + In ^ + lnE7^E^^(;,) [e^*^^*^-^^-^)^?^*^] (8) 



KL{pt\\pt) + 2 ln(t + 1) + In ^ + lnE^^(;,)ErJe^'*-^*('^)-(^-2)^t^*W] (9) 

<KLipt\\pt) + 2ln{t + l)+ln'^, (10) 

where (6) is by definition of Mt{pt) and Vt{pt)-, (7) is by Lemma 6, (8) holds with probability 
greater than 1 — | by Markov's inequality and a union bound over t, (9) is due to the fact 
that p,t is independent of Tt, and (10) is by Lemma 5. 

By applying the same argument to martingales —Mt{h) and taking a union bound over 
the two we obtain that with probability greater than 1 — 5: 

\M( ,,^ KL{pt\\pt) + 2Ht + l) + \nl ^ ^ 

Wt{pt)\ < r + (e - 2)XtVt{pt), 

M 

which is the statement of the theorem. The technical condition (1) follows from the require- 
ment that Xt G [0, ■ 



A. 2. Proof of Lemma 2 
Proof of Lemma 2 

t 

Vt{a) = J]E[([< - - [R{a*) - R{a)]f\Tr-i] 

r=l 
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where (11) is due to the fact that E[R'^\Tr-i] = R{a), (12) is due to the fact that Rt < I 
and tA(a)^ > 0, and (13) is due to the fact that — < — for all a and 1 < r < t. ■ 



A. 3. Proof of Theorem 4 

Proof of Theorem 4 We use the following regret decomposition: 

Aiprn = [Aiprn - mpD] + mpd + mpn - mm (14) 

The first term in the decomposition is bounded by Theorem 3. Before bounding the 
middle term in (14) we bound the last term, which is much simpler, and then return to the 
middle term. The bound on [R{pl''^) — R{pf'^)] is achieved by the following lemma. 

Lemma 7 Let p he an e-smoothed version of p, such that 

p{a) = (1 - Ke)p{a) + e. 

Then 

R{p) - R{p) < Ke. (15) 

Proof 

R{p) - R{~p) = Y,{P{^) - P{^))m) 

a 

a 

= \Y.\p{a)-{l-Ke)p{a)-E\ 

a 

= \ ^\Kep{a) -e\ 

a 

<\KeY,P{ci) + \Ke 



2 

a 

= Ke. 

In (16) we used the fact that < R{a) < 1 and p and p are probability distributions. ■ 

In the next lemma we bound A{pl'^^). 
Lemma 8 

MpD < —■ (17) 

It 

Proof Observe that by multiplying nominator and denominator in the definition of p'j^'''' 
by e~'^*^*("*) we obtain: 

g7ti?t{a) g-7tAi(a) 



exp / \ 



11 



Seldin Cesa-Bianchi Auer Laviolette Shawe- Taylor 



where Z'{pl'^^) = ^tA^a)^ The empirical regret At{pl'°'') then obtains the form: 

The lemma follows from Lemma 9 below and the observation that At{a*) = 0. 



Lemma 9 Let xi = and X2, . . . ,Xn ben—1 arbitrary numbers. For any a > and n > 2: 

Er=i ^^e— ' ln(n) 

E •=! e-"^^ - ^^^^ 

Proof Since negative Xi-s only decrease the left hand side of (18) we can assume without 
loss of generality that all positive. Due to symmetry, the maximum is achieved 

when all Xj-s (except xi) are equal: 

Ei=i e"""^^ ~ ^ 1 + (n - l)e-"^ ^ ^ 

We apply change of variables y = e~°^, which means that x = ^InK By substituting 
this into the right hand side of (19) we get 



(n-l)xe-°^' 1 ("-1)2/ In J 



l + (n-l)e-"^ a l + (n-l)y' 

In order to prove the bound we have to show that < Inn. 

By taking Taylor's expansion of In z around z = n we nave: 

1 z 

Inz < InnH — (z — n) = In nH 1. 

n n 



Thus: 

(n. — 1'|7/1ti ,,„ ^,u,**i,„ , 



(n-l)ylni ^ (n - l)y(lnn + ^ - 1) 
1 + (n - l)y ~ 1 + (n - l)y 

y(n-l)lnn+^ 



< 



(n-l)y + l 



< (y(n- 1) + l)lnn 

- y(n-l) + l ^ ^ 

= Inn, 



i < 1 

n — n 

which means that Inn > 1 — ^ = for all n > 0. 



where (20) follows from the fact that Inz < z — \ for any positive z, and hence In - < - — 1, 

1 _ n-l " 



Substitution of (5), (15), (17), and the choice of et and 7f in theorem formulation into 
(14) concludes the proof. 
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